the unreasonable effectiveness of recurrent neural networks

andrej karpathy, 2015-05-21


Recurrent Neural Networks

  • the model simplicity ↔ result quality ratio
  • Sequences
    • Vanilla and Convolutional Neural Networks are deeply limited by (1) their fixed-sized input (e.g. images) and output vectors (e.g. probabilities of different classes) and (2) their dependence on a fixed amount of computational steps (e.g number of layers in a model).
      • Recurrent nets allow operations over sequences of vectors as inputs, outputs or, generally, both. They use a learned function to combine input and state vectors, effectively simulating running programs with certain inputs and some internal variables. This makes them Turing-Complete (!) and allows them to simulate arbitrary programs, with proper weights (!!). ((img layers 3))
    • training Vanila Nets = optimization over functions, training Recurrent Nets = optimization over programs.
  • Sequencial processing in absence of sequences
    • Even if inputs/outputs are fixed vectors, they can be processed sequentially: models learn stateful programs that process fixed-sized data. (bluntly, programs can be precisely run on RNNs).
  • RNN computation
    • :::: RNNs then take input vector x and produce outputs vector y, influenced by both x and all past inputs fed in the past. Matrices are initialized with random numbers and training consists in finding the matrices that give rise to the desirable behavior (measured with some loss function that expresses what output y is desired in response to input sequence x).
    • RNNs are neural nets and work much better by stacking models, this means there are two separate RNNs: The first one receives input vectors and the other receives the output of the first RNN as input. (Neither know nor care, it's just a loop of gradients flowing through each module during backpropagation.)
    • ⤷ Long Short-Term Memory (LSTM) networks are slightly different and better in practice thanks to a more powerful update equation and some appealing backpropagation dynamics.
  • Character-Level Language Models
    • We can use RNNs to model the probability distribution of the next character in a sequence, given a sequence of previous characters. This then allows to generate new text one character at a time.
      • Suppose we want to train an RNN on sequence "hello" a vocabulary of four possible letters "h", "e", "l" and "o". The training sequence is then actually a source of 4 separate training examples: (1) the probability of "e" should be likely given "h" as input, (2) "l" should be likely given "he", (3) "l" should also be likely given "hel" and (4) "o" should be likely given "hell".
    • each character is 1-of-k encoded (all 0s except for 1s at the character's index in the vocabulary) and fed into the RNN sequentially. The RNN outputs 4D vectors (one dimension per character) representing confidence levels for the next character.
      • ((img weights)) increase in confidence of "e"
      • Since RNNs are entirely made of différentiable operation, we can run backpropagation (recursive application of calculus' chain rule) to adjust each weight and increase scores of correct targets (green bold numbers)
      • feeding the same inputs to the RNN after a parameter update (which nudges every weight a tiny bit in the gradient direction) would show correct characters ("e" in the first time step) with a higher score and incorrect characters with lower ones. This is repeated until the network converges and its predictions become consistent with the training data in that correct characters are always predicted next. (this essentially means that the standard Softmax classifier (cross-entropy loss) is used on every output vector simultaneously.)
    • :::: Notice that the first time "l" is inputted, the target is also "l", but the second time, the target is "o": the RNN ∴ cannot rely on input alone, it must use its reçurent connection to track context and achieve the task.
    • feed character into the RNN, get distribution of likely next characters, sample from said distribution and feed right back to get the next letter. repeating this process results in text sampling (!).
  • RNN Training on different text datasets: predicting next character from input.
    • Temperature:
      • Lower temperature (<1, e.g. 0.5) → more confidence, safer predictions, repetitive.
      • Higher temperature (>1) → more diversity, more errors.
      • Near zero → repeats the most likely phrase endlessly.
    1. Paul Graham’s essays (~1MB, ~1m characters) very small data set, a lot of wisdom but unfortunately, PG is a relatively slow generator.
      • 2-layer LSTM, 512 hidden nodes (~3.5m parameters), dropout=0.5, batches=100, backprop length=100. One batch on a TITAN Z GPU ≈0.46s.
        "The surprised in investors weren't going to raise money.I'm not the company with the time there are all interesting quickly,don't have to get off the same programmers.There's a super-angel round fundraising,why do you can do.If you have a different physical investment are become in people who reduced in a startup with the way to argument the acquirer could see them just that you're also the founders will part of users'affords that and an alternation to the idea.[2]Don't work at first member to see the way kids will seem in advance of a bad successful startup.And if you have to act the big company too."
      • RNN learns basic English structure and punctuation from scratch. Often nonsensical but shows glimmers of insight ("a company is a meeting to think to investors"), referencing arguments and citations ("[2]").
    2. Shakespeare (4.4MB)
      • 3-layer RNN, 512 hidden nodes each.
      • ((img Shakespeare I & II))
      • After training, samples resemble Shakespeare-like character names and dialog structure, but still obviously generated.
    3. Wikipedia (100MB)
      • ((img wikipedia))
        Naturalism and decision for the majority of Arab countries' capitalide waby the Irish language by [[John Clair], [[An Imperial Japanese Revolt], with Guangzham's sovereignty. His generals were the powerful ruler of the in the [[Protestant Imineners], which could be said to be directly in Communication, which followed a ceremony and set inspired prison, training emperor travelled back to [[Antioch, Perth, October 25|21] to note, the Kof Costa Rica, unsuccessful fashioned the [[Thrales], [[Cynth's Dajoard] in western [[Scotland], near Italy to the conquest of India with the confCopyright was the succession of independence in the slop of Syrian influewas a famous German movement based on a more popular serviziocus, non-doct and sexual power post. Many governments recognize the military housing of [[Civil Liberalization and Infantry Resolution 265 National Party in Hunga that is sympathetic to be to the [[Punjab Resolution] (PJS) [http://www.humah.yahoo.com/guardian.cfm/7754800786d17551963s89.htm Official economics Adjoint for the Nazism, was swear to advance to the resources for those Socialism's rule, was starting to signing a major tripad of aid exile.]
      • Generates pseudo-Wikipedia text with proper "[[]]", hallucinated URLs, random but valid XML-like tags.
    4. Algebraic Geometry Latex (16MB)
      • Generates Latex source, but struggles to match environments correctly.
      • :::: RNN chooses to leaves some proofs ("Proof omitted.", top left), mixes up closing tags/environments.
    5. Linux Source Code (474MB)
      • Large 3-layer LSTM (~10M params) trained for days.
      • ((img code))
      • Generates plausible C code with comments, indentations, balanced brackets, and some valid syntax.
      • Common issues: inconsistent variable usage, returns in void functions, and random file restarts (like reprinting a license).
      • Sometimes the model decides that it’s time to sample a new file. This is usually a very amusing part: The model first recites the GNU license character by character, samples a few includes, generates some macros and then dives into the code ((img code docs gen))
    6. Baby Names (8k names)
      • One name per line as training data. Generates many plausible new names not in the training set.
        Rudi Levette Berice Lussa Hany Mareanne Chrestina Carissy Marylen Hammine Janye Marlise Jacacrie Hendred Romand Charienna Nenotto Ette Dorane Wallen Marly Darine Salina Elvyn Ersia Maralena Minoria Ellia Charmin Antley Nerille Chelon Walmor Evena Jeryly Stachon Charisa Allisa Anatha Cathanie Geetra Alexie Jerin Cassen Herbett Cossie Velen Daurenge Robester Shermond Terisa Licia Roselen Ferine Jayn Lusine Charyanne Sales Sanny Resa Wallon Martine Merus Jelen Candica Wallin Tel Rachene Tarine Ozila Ketia Shanne Arnande Karella Roselina Alessia
  • Training Evolution (War and Peace LSTM)
    • Iteration 100: random jumbles, some word spacing, no consistent punctuation.
      tyntd-iafhatawiardemot lytdws, tfti, astaifogoh e oaser rranbyn epliatklrgt oid oens, smtt hn e etieh, hregtrs nig tike, aoaens l ng
    • Iteration 300: quotes, periods, better spacing.
      "Tm on tth i they" for mes scrl iund Keushey. Thomh e res heu lke, an mere nith ol sivh I lalter then d Ble ip ilesh uwy filon a set erlo me coan iog ennc Ph el is mth ond hon at. Mei Dimo ro tion in ther thize."
    • Iteration 500: shortest common words (we, He, His, Which) learned.
      we counter. He stutn codes. His stanted out one of ler that concossions ato gear ang reay Jotrets and with fre colt otf paitt thin wall. Which das
    • Iteration 700: increasingly English-like text.
      Aftair fall unsuch that the hall for Prince Velzonski's that me ofher hear ly, and behsto so ar wage fiving were to it beloge, pavusay fallhow, and Gogition is so overelical and ofter.
    • Iteration 1200: uses quotes, question/exclamation marks, longer words.
      "Kite vouch!" he repeated by her door. "But I would be done and quarts, feeling, then, son is people...."
    • Iteration 2000: properly spelled words, natural quotations, names, and coherent phrases.
      "Why do what that day," replied Natasha, and wishing to himself the fact tprince, Princess Mary was easier, fed in had oftened him. Pierre aking his soul came to the packs and drove up his father-in-law wom."
  • :::: As training progresses, model first learns basic structure (spaces, punctuation) and shortest words, then gradually longer words and complex patterns emerge. Long-term dependencies (topics, themes...) appear later.
    • stretches of characters where the model is extremely confident ("http://www" sequences). High excitement on [[]] but the neuron interestingly must wait for the second "[" to activate.
    • ((img next word prediction))
    • Some neurons activate only inside double brackets [[...]], or count how many [ have appeared. Others track progress within "www" sequences, or detect quotes.
      • ((img next word prediction II))
      • ((img next word neuron III))
    • :::: ~5% of cells learn meaningful, interpretable behaviors (e.g. quote detection), despite never being explicitly programmed. this is one of the cleanest and most compelling examples of where the power in Deep Learning models (and more generally end-to-end training) is coming from.
      • ((img neuron interpretable algorithms I & II))
  • Further Reading & Applications
    • RNNs, like CNNs, are old but now recognized as powerful due to more compute.
    • NLP/Speech: transcribe speech→text , translate, generate handwritten text and obvio, language models (Sutskever et al.)
    • Computer Vision: classify video frames, caption images, answer visual questions.
    • Challenges:
      • Not always inductive, sometimes just memorize. Weak signs of correct generalization.
      • Unnecesary coupling of representation size to amount of computation per step due to matrix multiplication (hidden state vector×2 → FLOPs×4), scaling hidden state size increases computation. The goal would be to maintain huge representation of intermediate state variables while keeping computation per time step fixed.
    • DeepMind's Neural Turing Machines paper and attention-based models sketch a path towards achieving read/write operations between large, external memory arrays and a smaller set of memory registers (~working memory) where the computation happens. Very interesting memory addressing mechanisms implemented with a (soft, and fully-differentiable) attention model.
      • :::: concept of soft attention might be the most interesting recent architectural innovation in neural networks.
      • soft attention schemes for memory addressing are convenient since they keep models fully differeentiable but efficiency is sacrificed as all that can be attended to is attended to (but softy).
        • this would be like declaring a pointer in C that defines an entire distribution over all adresses in the entire memory instead of a specific address and returns a weighted sum of the pointed content when dereferenced (expensive af).
      • Attention is a key recent innovation. Soft attention is fully differentiable but expensive; hard attention (sampling particular chunks of memory, like read/write action for some memory cell instead of reading/writing from all cells) is more efficient but nondifferentiable.

note mentions